Conversation
Adds support for OpenTelemetry tracing, as described in the included docs. Custom instrumentation is not yet supported.
|
Review requested:
|
legendecas
left a comment
There was a problem hiding this comment.
AFAICT, it would create confusion that the built-in support does not work with the @opentelemetry/api package, which is intended to be the global provider for otel components, like propagators and exporters.
|
I personally find the opentelemetry apis really cumbersome to work with. Maybe there is something better that can be done. But I also concur that some kind of high level compatibility api would need to exist. |
|
It would be really nice to get ahold of https://www.npmjs.com/package/otel and add this new core module without the prefix. |
| otel.start({ endpoint: 'http://collector.example.com:4318' }); | ||
|
|
||
| // ... application code ... | ||
| otel.stop(); |
There was a problem hiding this comment.
More a style nit... feel free to ignore. I find this start()/stop() pattern a bit cumbersome. I'd be happier something like...
const stop = otel.start();
stop();which also allows for...
{
using session = otel.start();
// automatically stops
}There was a problem hiding this comment.
While using is great, I don't think it's actually appropriate here. This is for enable and disabling an entire subsystem. stop() would probably actually be pretty rare, probably only used for some kind of error recovery / process-teardown. It definitely doesn't fit the usual use case for using, where we'd expect a resource to be disposed at the end of a variable scope.
| * `options` {Object} | ||
| * `endpoint` {string} The OTLP collector endpoint URL. | ||
| **Default:** `'http://localhost:4318'`. | ||
| * `filter` {string|string\[]} Optional comma-separated string or array of |
There was a problem hiding this comment.
Nit: can we pick one? I'd prefer just string[].
There was a problem hiding this comment.
It's the way it is for consistency with the environment variable. It could be an array here in the API and then comma-separated in the environment variable.
Agree that an OpenTelemetry SDK is not built-in to the Node.js instrumentation. However, this implementation is not fully compliant with the OpenTelemetry design. The instrumentation should not implement the OpenTelemetry SDK components, like exporters and tracer SDK. The three OpenTelemetry layers, API, SDK, and instrumentations, have their clear scopes and allows open customization without any vendor lock-in. That is, instrumentations will use the API to create spans and metrics, but do not directly depend on the SDK. This allows instrumentations to be vendor-agnostic. And observability vendors could provide their SDK (or just an exporter) to do whatever propriatary works, like exporting. I think a Node.js built-in |
Qard
left a comment
There was a problem hiding this comment.
Some small questions/ideas, but otherwise LGTM!
| ### `OTEL_SERVICE_NAME` | ||
|
|
||
| Standard OpenTelemetry environment variable used to set the service name in | ||
| exported resource attributes. Defaults to `node-<pid>`. |
There was a problem hiding this comment.
We already have to load the package.json during startup, don't we? So perhaps we could make this use the name field from it, if present, and fallback to that?
| let timeOriginNs; | ||
| function getTimeOriginNs() { | ||
| if (timeOriginNs === undefined) { | ||
| // getTimeOriginTimestamp() returns ms since Unix epoch. | ||
| // Multiply by 1e6 to get nanoseconds. | ||
| const originMs = getTimeOriginTimestamp(); | ||
| timeOriginNs = BigInt(MathRound(originMs * 1e6)); | ||
| } | ||
| return timeOriginNs; | ||
| } | ||
|
|
||
| function hrTimeToNanos() { | ||
| const relativeMs = now(); | ||
| const ns = getTimeOriginNs() + BigInt(MathRound(relativeMs * 1e6)); | ||
| return `${ns}`; | ||
| } |
There was a problem hiding this comment.
I'm wondering a bit about the perf of this. Did you try a high/low pair like process.hrtime() returns to be able to do the 64-bit timestamp rather than this string approach? I'm a bit worried about the perf implications of lots of bigint timestamps being generated all the time. 🤔
There was a problem hiding this comment.
The OTEL hrtime, as far as I have been able to determine, is not compatible with process.hrtime. It is based upon performance.timeOrigin -- https://github.com/newrelic/node-newrelic/blob/39d1d9a59ca467d71d4a7db4cf437d743a24939b/lib/otel/normalize-timestamp.js#L56-L85
There was a problem hiding this comment.
Aside from performance concerns, is it a good idea to depend entirely on a nonmonotonic clock? This may result in negative duration or out of order spans.
There was a problem hiding this comment.
Also, shouldn't it be BigInt(MathRound(originMs)) * 1_000_000n to avoid loss of precision?
There was a problem hiding this comment.
I don't mean using process.hrtime itself, I mean using the pattern of separating high/low parts up to the encoding edge so we don't need to be operating on bigints. The public API surface could also convert to whatever types we need, if we want to duplicate official OpenTelemetry behaviour exactly. I'm just thinking the internals could be a lot better about perf.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #61907 +/- ##
==========================================
+ Coverage 89.73% 89.75% +0.01%
==========================================
Files 675 680 +5
Lines 204674 205883 +1209
Branches 39330 39565 +235
==========================================
+ Hits 183667 184781 +1114
- Misses 13283 13387 +104
+ Partials 7724 7715 -9
🚀 New features to boost your workflow:
|
| urlStr + '/v1/traces'; | ||
| } | ||
|
|
||
| const parsed = new URL(urlStr); |
There was a problem hiding this comment.
This condition could be simplified to the following:
const parsed = new URL('/v1/traces', endpoint);Tests with the Node REPL:
> new URL("/v1/traces", "http://localhost:4318/hello?foo=bar").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/v1").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/v1/traces").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/v1/traces/").href
'http://localhost:4318/v1/traces'|
I have a lot of thoughts about this so I'll try to keep it succinct and direct. In the interest of full disclosure, I have been one of the maintainers of OTel JS for years and that will likely color some of my opinions. I am not on the OTel Governance Committee and the opinions here should be taken as my own and not as the official stance of the project. I'll start with the things I like. First, I think having built-in support for OTel is great, and focusing on HTTP first makes sense. The lack of interoperability with existing instrumentation means the use of this will be very limited for now, but that's probably fine. Anyone who needs other instrumentation has the option to use the existing otel libraries. Second, i think it is reasonable to only support OTLP. Supporting other export formats is a well intentioned part of OTel, but it was a lot more important in the earlier days when OTLP support was limited. Now all major vendors support OTLP, and the collector serves as a mature routing layer for people who need to use other formats. I would prefer to have OTLP protobuf as an option since it is by far the most used OTLP format, but JSON makes sense as the de facto standard format for JS remote calls and I understand the desire to avoid the protobuf dependency. Third, I do see the value in a reimplementation which is node-specific and avoids the need for various platform-specific workarounds. Because this is shipped with node itself, there is no need to retain workarounds for old node versions or non-node platforms like browsers. Finally, this seems really easy to use. The lack of configuration, customization, and manual instrumentation definitely makes this simpler to understand and use for end users, and because it is built in there is no need to make sure your otel dependency versions align with each other. I do have a couple issues with the way this is implemented. This seems to have completely done away with most of the features required by the OTel specification. For example, it is missing all of the normal extension points (configuration, propagators, samplers, id generators, span processors, exporters, resource, entities). I'm not sure if this was an unintentional side-effect of vibe coding a replacement SDK or an intentional choice to "avoid bloat," but it really undercuts a lot of the value of OTel. I might even go so far as to say it's not quite accurate to call this "OpenTelemetry support" without those things. It might be better to call this "OTLP span export" or something like that. At the very least I'd make sure you reach out to the OTel Governance Committee to obtain permission to use the OpenTelemetry trademark. I'm not a lawyer and not on the GC, but this is an area where I think it pays to be careful. Finally, and this is my biggest concern, I agree with @legendecas that this seems like it would fracture the otel ecosystem on node. I am most concerned about the complete lack of interoperability with existing otel libraries. It seems to me that there is definitely a way that this could be done that would retain interoperability and allow a gradual transition to a built in implementation without such a fracture. I understand that obviously node can't take a dependency on an NPM module, but I'm certain the otel JS SDK would not have any issue taking a dependency on built-in modules where the experience could be improved for users. |
|
The PR description probably could have used a bit more details. Sorry about that. First, let me address compatibility/fracturing and support (or lack thereof) for Second, I agree that calling this PR "OpenTelemetry support" could be misleading if users expect full API support. I'm happy to rename things as appropriate, or move the programmatic activation to some other module (though, I kept it as a new module here knowing that I'd eventually want to export a |
| The tracing subsystem automatically propagates [W3C Trace Context][] across HTTP | ||
| boundaries: | ||
|
|
||
| * **Incoming requests**: The `traceparent` header is read from incoming HTTP |
There was a problem hiding this comment.
I think this should either include tracestate handling otherwise it's not complete from W3C spec point of view.
There was a problem hiding this comment.
With tracestate unused by this internally it may be as simple as propagating it with the span so it can be attached on client spans. No need to parse it for now.
|
|
||
| This module is only available under the `node:` scheme. It requires the | ||
| `--experimental-otel` CLI flag or one of the `NODE_OTEL` / `NODE_OTEL_ENDPOINT` | ||
| environment variables. |
There was a problem hiding this comment.
I think we should clearly mention the limitations of the current state like
- incompatible with OTel components from the OpenTelemetry project
- trace.id/span.id duplication
- likely overwrite of W3C tracecontext headers in outgoing requests
I guess a user has to decide to use either the one or the other OTel variant and avoid to use both in the same node.js process.
|
|
||
| > Stability: 1 - Experimental | ||
|
|
||
| When set to a non-empty value, enables the built-in OpenTelemetry tracing |
There was a problem hiding this comment.
really any value? e.g. having NODE_OTEL=disable or NODE_OTEL=0 to enable it seems a bit uncommon.
Quite some other node env vars check on the value 1.
| Uint8Array, | ||
| } = primordials; | ||
|
|
||
| const { randomFillSync } = require('internal/crypto/random'); |
There was a problem hiding this comment.
Is it required to use cryptographically strong pseudorandom here? Or is the perf impact don't care?
There was a problem hiding this comment.
Not required to be crypto random by spec
| } | ||
| } | ||
|
|
||
| function flush() { |
There was a problem hiding this comment.
Shouldn't flush return a promise to allow users to wait until all data is sent? e.g. before shutdown.
| } | ||
|
|
||
| if (isModuleEnabled('node:http')) { | ||
| sub('http.server.request.start', onHttpServerRequestStart); |
There was a problem hiding this comment.
better to check is it's active and module enabled before subscribing
|
|
||
| // Generate W3C traceparent header value. | ||
| // Format: {version}-{trace-id}-{span-id}-{trace-flags} | ||
| inject() { |
There was a problem hiding this comment.
maybe rename to getTraceParent(). This function doesn't inject anything as of now.
| if (typeof value === 'boolean') { | ||
| return { boolValue: value }; | ||
| } | ||
| return { stringValue: String(value) }; |
There was a problem hiding this comment.
Is there intentionally no array support?
| lastExportWarningTime = now; | ||
| const suffix = exportFailureCount > 1 ? | ||
| ` (${exportFailureCount} total failures)` : ''; | ||
| process.emitWarning( |
There was a problem hiding this comment.
Maybe otel events should be emitted on the otel object instead process.
|
|
||
| const transport = parsed.protocol === 'https:' ? https : http; | ||
|
|
||
| const req = transport.request({ |
There was a problem hiding this comment.
consider to use an http agent to keep the connection alive for some time.
| } | ||
| } | ||
|
|
||
| function sendToCollector(body) { |
There was a problem hiding this comment.
There is no throttling, so if a lot spans are ended and/or connection is slow quite a lot parallel requests might be created.
| request[kSpan] = span; | ||
|
|
||
| const storage = getSpanStorage(); | ||
| storage.enterWith(span); |
There was a problem hiding this comment.
Why do request[kSpan] if you're also using AsyncLocalStorage? Is this just a performance optimization to avoid calling ALS again on response finish?
A very welcome goal. We are extremely interested in this moving forward. We haven't even technically released/announced our support for bridging OTEL signals into our agent, and we have already had complaints about the insane bloat the OTEL dependencies add as disk size, and complaints about the memory impacts of just loading them. But we really don't have the time to go off and re-implement the OTEL API+SDK just to elide all of the things the reference implementations bring in. If we can set up a GitHub project of issues to resolve, that would make it far easier for those interested to contribute and get this feature fully implemented. |
Actually, Node.js can take a dependency on an NPM module. There are precedents like
Thank you for clarifying on this! I really appreciate it that fracturing is not the goal of this PR. It was really confusing that this PR did every OpenTelemetry API/SDK/Instrumentation from scratch but not complete, despite that there are official OpenTelemetry JS implementation.
The official open-source OpenTelemetry JS impl is implementing the OpenTelemetry standard specification to support a variaty of data pipelines and configurations. If the goal of the Node.js OpenTelemetry support is to be fully OpenTelemetry spec compliant, it is really confusing that how this PR and the following-up work to "avoid bloat" in the end, essentially replicating the OpenTelemetry JS impl inside Node.js. There are precedents that an npm package can be bundled in Node.js, and |
| request[kSpan] = span; | ||
|
|
||
| const storage = getSpanStorage(); | ||
| storage.enterWith(span); |
There was a problem hiding this comment.
It seems this enterWith() leaks as there is no enterWith() using the previous store.
| ); | ||
|
|
||
| span.setAttribute('http.request.method', method); | ||
| span.setAttribute('server.address', request.origin || ''); |
There was a problem hiding this comment.
Maybe better to set attribute only if we have an actual value instead sending an empty string.
There was a problem hiding this comment.
I'm pretty sure the empty string is consistent with the reference implementation. But it's too difficult to find to link to it.
| subsystem and directs spans to the specified OTLP/HTTP collector endpoint. The | ||
| `/v1/traces` path is appended automatically. Also makes the `node:otel` module |
There was a problem hiding this comment.
If a full URI with path is provided, then /v1/traces should not be appended.
|
|
||
| const SPAN_KIND_INTERNAL = 1; | ||
| const SPAN_KIND_SERVER = 2; | ||
| const SPAN_KIND_CLIENT = 3; |
There was a problem hiding this comment.
would be helpful to have some link to the source of this magic constants.
I assume they can be found somewhere in OTLP spec.
There was a problem hiding this comment.
| function onHttpClientRequestCreated({ request }) { | ||
| if (!isActive() || !isModuleEnabled('node:http')) return; | ||
|
|
||
| if (request.getHeader('host') === getCollectorHost()) return; |
There was a problem hiding this comment.
Maybe we should use a more fine grained exporter filter here as the same host might have more endpoints and some of them might be of interest.
I have an equivalent of https://github.com/datadog/pprof-format for OTLP that I'd like to add to this at some point, when I get to finishing it, so that may get resolved before too long. 🙂
In my opinion it's a feature not having all that. The majority of users don't actually need most of those things, and there's very real overhead to needing to delegate across this complicated cluster of types in hot code. A minimal API with sensible defaults makes a lot more sense to me.
There's a sort of irony that the thing which purports to prevent vendor lock-in is so rigidly unable to allow using just a subset of its features. In my opinion the full OpenTelemetry API exists at a level which caters to end-users, not to implementors. To me arguing that the OpenTelemetry API should exist in core is like arguing that the express router should exist in core--it is an opinionated user-facing API surface, not core infrastructure. It is the wrong level for the actual implementor concerns, this is what diagnostics_channel was created for. Having some simplistic support for producing OTLP in core makes sense to me, but I personally don't actually see the value in having the entire API in core, especially when it already exists and is completely usable as-is in userland. I do respect that OpenTelemetry seems to have "won" the race and achieved critical mass in adoption, but I think just directly absorbing the decisions it has made is a mistake. We should be focusing on what parts of its internals suffer from the limitations of being in userland and fixing those problems more surgically. Over time it's likely more of the infrastructure would accumulate in core and we may eventually end up with the whole API, but just pulling it all in at once is likely to just rot and takes the focus away from finding the actual problems which convince one that there is benefit to being in core. I'm not generally one for the small core ethos at this point, it's good to have a full toolbox, but I would be hesitant to absorb such a strongly opinionated API without deeply considering why each decision was made and if that is a decision we want to be making in core. I realize the message prior to this may sound a bit harsh, which I don't intend it to be. I would like to see at least OTLP export in core, and I very much understand the utility of the broader capability which OpenTelemetry provides. I just think we should exercise caution in pushing for the entire API rather than starting small and building on top with small building blocks. If we're going to adopt a significant new API surface that should involve significant engineering review and deep thinking on the implications. I'm quite happy this went for the minimal surface area it did just to get a usable tool in there which can be expanded to enable greater extensibility later. I would very much like to see the scope of this not bloat significantly to try and cover every use case of OpenTelemetry. I would much prefer we take the time to learn and discover what actually needs to be in core to enable the ecosystem around observability on Node.js to thrive. |
Please note that the OpenTelemetry does define specification for language APIs, SDK interfaces, semantic conventions and OTLP as well, as part of its standard: https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification. OpenTelemetry language maintainers could leverage language-specific designs, but the general API interfaces are subject to the OpenTelemetry specifications. This allows homogenous data pipelines and data structure at the client side and defines the assumption how to consume the OpenTelemetry data. |
| span.setAttribute('http.request.method', method); | ||
| span.setAttribute('server.address', request.origin || ''); | ||
|
|
||
| const url = `${request.origin}${request.path}`; |
There was a problem hiding this comment.
above request.origin is defaulted to "" - here this would default to undefined.
|
I agree that there is no need to pull in the full OTel SDK/API - and for sure not in the intial PR. If parts like an OTLP exporter or HTTP instrumentations are in core it would be really nice to have them available also for usage with the OTel SDK to avoid that users end up in 1.75 OTel SDKs in many node.js process. |
|
Yes, I'm very aware of the depth of specification defined by OpenTelemetry. That's exactly what makes me hesitant about it. The specification down to cross-language API surface is heavy-handed. All that ultimately matters to implementors is that it can produce the same shape and semantics of output. The OTLP format specification alone is all that was ever really needed for implementors. The OpenTelemetry API imposes a lot of API design decisions which are not idiomatic or even performant across all languages. It also encodes a lot of baggage into its specification by having set its specification in a time before a particular way to use it had become dominant. OpenTelemetry is a specification built around integration, not interoperability. Meaning it takes every idea it likes and encodes it directly into one monolithic specification for the entire category of Observability rather than defining narrower specifications of different components of the architecture to be interoperable with each other to leave room for alternative interfaces to those interoperability points. Consider HTTP specifications, we were able to define Server-Sent Events as a layer over the cluster of specifications which define HTTP because it was built to be interoperable. Despite the concepts applied by SSE not existing when HTTP was defined, it was able to layer over HTTP reasonably cleanly by simply having the flexibility and interoperability to do so. OpenTelemetry does not have this. It defines a complex end-to-end model which leaves little room for interoperability with anything not integrated into its specification directly. Just because something has a specification doesn't mean it's good or should be encouraged that others follow it. We should examine what exactly core and the ecosystem gains from shipping that in core as opposed to continuing to let it flourish in userland where it has been doing just fine. We should try to articulate what actually is the benefit of putting it in core. Do we want an out-of-the-box experience? If so, would it not make more sense to have a sensibly-defaulted and simplified thing? Is it to address some performance or design constraints? Perhaps those specific things should be raised to figure out how they can be addressed more directly rather than assuming that mere presence in core would resolve whatever concerns exist? I've heard many with a vested interest in OpenTelemetry stating it as a given that it should just be in core, but I have not heard a single explanation as to why it is important that the entire API surface exists in core. That's not to say it should not be, but rather that we should seek much greater clarity on exactly what and why. |
In my personal view: because of the reasons you stated above. The reference implementation is a real pain to work with and seems way over engineered. I would much rather see a simple implementation of the API+SDK in core as a standard so that we can rely on it to generate spans/whatever for any library that wants to generate OTEL data; right down to accepting the registration of custom exporters and the like. |
Don't worry about it. I find direct feedback a lot easier to read anyway and you're not going to hurt my feelings. Nobody is more aware than me of the shortcomings of OTel JS and I don't delude myself into thinking everyone's experience is all roses. FWIW I also agree that the entire OpenTelemetry SDK does not need to be in core. The separation between API and SDK was meant to allow library authors to take a dependency on the API without taking a dependency on the SDK so that the end user can insert an SDK implementation that made sense for them. Some vendors have taken advantage of this in multiple languages to publish stripped-down SDKs with less features or dropping support for old/alternative runtimes to gain other advantages like performance, memory use, or deployment size. When they're built into the runtime, that distinction loses a lot of value. IMO, the vendor neutrality of OTel comes from OTLP much more than it comes from custom SDK implementations and the API/SDK split.
It seems to me like you're referring to what OTel calls the SDK not the API. More details in the next section but the terminology is important to make sense of what I'm about to say. This is where the API/SDK split shines IMO. These features may be only useful to a subset of the community, but that isn't the same thing as saying they're useless. I think this part is where the biggest risk of ecosystem fracture comes in. If node builds in an API people will use it. If that API goes into a blackbox SDK not accessible to users like in this implementation, it will be impossible for users to define these things. Some users who do need it would then use the OTel SDK but would receive a significantly degraded experience. If there is 2-way interop between the built-in API and the OTel API we could avoid this, allowing users who need the full suite of features of the OTel SDK and users who only need the minimal built in version to share the same instrumentations and avoid too much duplication of work. In the ideal case, the upstream SDK would use the built-in versions of the basic features where possible and only implement the "power user" features. WRT reducing API surface I assume you're referring to the SDK not the API. In OpenTelemetry parlance, the tracing API itself is not a massive API surface. The way it is implemented results in more code than necessary and if I built it again today I'd make different decisions (which we have the freedom to do here). A lot of that code focuses on cases that aren't important here (multiple different API versions in 1 process, instrumentation startup order, etc) and can be removed or reworked if there's a built-in API. Modern features like diagnostics channels that didn't exist at the time also provide some breathing room. The API itself is only a You may be interested in a POC I wrote as an example of "if I built it again today I'd make different decisions." It uses a simplified clone of diagnostics channel (in order to be usable on web also) and results in a DRASTIC reduction in overall code. This also focuses on retaining existing API features like Trace API (with the exception of the channel and some shared types, this 218 line file is the whole thing): https://github.com/dyladan/api-2-poc/blob/main/src/trace.ts |
💯 this. OTLP is the only part of OpenTelemetry that I feel is fairly universal. Even that I don't entirely agree with as it imposes some decisions like span data lives for the life of a span rather than being able to separate start event from end event, but it's universal enough to work for most use cases effectively. No spec is perfect, some weaknesses like that are to be expected. If specs are defined to be interoperable then that also enables them to be replaceable. http1 made a bunch of mistakes that http2 tried to address, it made a bunch of its own mistakes which were then addressed by http3. The different http protocol versions are hugely different, and yet it was feasible to make such a transition because they were always designed with interoperability in mind.
Mainly, yes. The SDK over-complicates things with a bunch of interfaces the majority of users will never need--samplers, propagators, id generators, span processors, exporters, etc. Sensible defaults for all of those are plenty for 90% of cases. Not a fan of the whole TracerProvider -> Tracer -> Span flow to do basically anything, but that's not a significant issue. It's mainly just a usability thing that it adds a bunch of steps the majority of users should never need to care about--Node.js could simply hold a Tracer internally and just have a module-exported Your minimal version looks pretty good, actually. I particularly like that it's not holding span state on the span objects. That is, in my opinion, the most significant anti-pattern in the Observability segment, which the object-based model of OpenTelemetry has not helped with avoiding. Whatever we land on in core, we should not be holding data unnecessarily and especially not making it retrievable in any way beyond just getting the OTLP output out the other end.
|
Adds support for OpenTelemetry tracing, as described in the included docs. Custom instrumentation is not yet supported.
This PR is intended to lay the groundwork for future PRs where API support for custom instrumentations is added. The OpenTelemetry SDK is explicitly not used, mainly to avoid bloat.
A programmatic API to enable and disable tracing is added as a new module,
node:otel, but I'm very open to suggestions in terms of where to add it instead, keeping in mind that the intent is to eventually expose an API for custom instrumentation alongside it.Disclaimer: As an experiment, Claude Code was used to build this PR, based on a hand-written prototype I built (but did not publish) about a year ago. Everything was thoroughly reviewed by me prior to PR submission, including some hand-edits.